Setting up Data and Functions

We start off, by fetching the data from wineQualityReds csv file and storing into a variable wineQualityData.

Data Summary

wineQualityData <- read.csv('wineQualityReds.csv', head = TRUE)
summary(wineQualityData)
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Data: We have 1599 rows of data where X is the unique identifier for each wine. There are 11 metrics which decide the quality of the wine. Quality is an ordered variable where values range from 3 to 8 for our given sent of wines. The mean wine quality is 5.6.

Defining Plotting Function

library(ggplot2)
library(gridExtra)

plot_univariate <- function(property, lower_limit, higher_limit, bin_width) {
  grid.arrange(ggplot(wineQualityData, aes(x = 1, y = property)) + 
               geom_boxplot(color = 'black', fill = 'steelblue') + 
               scale_y_continuous(lim = c(lower_limit, higher_limit)),
             ggplot(data = wineQualityData, aes(x = property)) + 
               geom_histogram(binwidth = bin_width, color = 'black', fill = 'tan1') +  
               scale_x_continuous(lim = c(lower_limit, higher_limit)),
             ncol = 2)
}

plot_bivariate_wrt_quality <- function(property) {
    x = seq(property)
grid.arrange(ggplot(wineQualityData, aes(x = x, y = property, color = factor(wineQualityData$quality), shape = factor(wineQualityData$quality))) +
    geom_point(size = 2, alpha = 0.4) + 
    scale_color_identity(guide = 'legend') + 
  ylab('Property') + 
  xlab('#'),
    ggplot(wineQualityData, aes(x = factor(wineQualityData$quality), y = property)) + 
      geom_boxplot(),
    nrow = 2)
}

We start off by exploring univariate variables to find correlations between the attributes and the quality of a wine.

Univariate Data Analysis

Distribution of Quality

ggplot(wineQualityData, aes(x = factor(wineQualityData$quality))) + 
    geom_bar(stat = "count", width = 0.5, fill = "steelblue", color = 'black') + 
    xlab('Quality') + 
    ylab('Count') + 
    theme_minimal()

Majority of the wines have quality between 5 and 6 with very few wines being really good or bad (8 or 3 respectively).

Distribution of Each Property

grid.arrange(qplot(wineQualityData$fixed.acidity),
             qplot(wineQualityData$volatile.acidity),
             qplot(wineQualityData$citric.acid),
             qplot(wineQualityData$residual.sugar),
             qplot(wineQualityData$chlorides),
             qplot(wineQualityData$free.sulfur.dioxide),
             qplot(wineQualityData$total.sulfur.dioxide),
             qplot(wineQualityData$density),
             qplot(wineQualityData$pH),
             qplot(wineQualityData$sulphates),
             qplot(wineQualityData$alcohol))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We analyze each property individually.

1. Fixed Acidity

From the above plot, it appears that majority of the values for fixed acidity lie in the range 5 to 14. So we limit our fixed acidity values to this range.

plot_univariate(wineQualityData$fixed.acidity, 5, 14, 1)
## Warning: Removed 11 rows containing non-finite values (stat_boxplot).
## Warning: Removed 11 rows containing non-finite values (stat_bin).

The median for fixed acidity is somewhere around 8 and the graph is positively skewed. Large number of values lie in the range of 7 to 9.

2. Volatile Acidity

Majority values for volatile acidity lie in the range of 0.2 to 1.

plot_univariate(wineQualityData$volatile.acidity, 0.2, 1, 0.1)
## Warning: Removed 38 rows containing non-finite values (stat_boxplot).
## Warning: Removed 38 rows containing non-finite values (stat_bin).

The median is around .54 and this distribution is also positively skewed.

3. Citric Acid

A lot of citric acid values appear to be zero. The data available for citric acid might be incomplete.

plot_univariate(wineQualityData$citric.acid, 0, 0.75, 0.1)
## Warning: Removed 6 rows containing non-finite values (stat_boxplot).
## Warning: Removed 6 rows containing non-finite values (stat_bin).

4. Residual Sugar

The graph for residual sugar is heavily skewed towards the left and most of the data lies in the range 1 to 5.

plot_univariate(wineQualityData$residual.sugar, 1, 5, 0.5)
## Warning: Removed 86 rows containing non-finite values (stat_boxplot).
## Warning: Removed 86 rows containing non-finite values (stat_bin).

Even after filtering some outliers, the data is still positively skewed with a median around 2.25.

5. Chlorides

The data for chlorides is similar to that of residual sugar. We consider the data that lies between 0.04 and 0.14.

plot_univariate(wineQualityData$chlorides, 0.04, 0.14, 0.01)
## Warning: Removed 81 rows containing non-finite values (stat_boxplot).
## Warning: Removed 81 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

The data for this range appears to be normally distributed with a few outliers. The median is around 0.08.

6. Free Sulfur Dioxide

Most of the values for free sulfur dioxide lie in the range of 0 to 35.

plot_univariate(wineQualityData$free.sulfur.dioxide, 0, 35, 2)
## Warning: Removed 77 rows containing non-finite values (stat_boxplot).
## Warning: Removed 77 rows containing non-finite values (stat_bin).

In this property we see a high peak around 7-8 which gives our graph a positive skew. The median, however, is around 13. This is becuase of the long tail of values in the high range.

7. Total Sulfur Dioixide

Most of the values are in the range 0 to 100. Since free sulfur dioxide is a subset of total sulfur dioxide, we can expect to see a similar positively skewed graph for total sulfur dioxide.

plot_univariate(wineQualityData$total.sulfur.dioxide, 0, 100, 5)
## Warning: Removed 127 rows containing non-finite values (stat_boxplot).
## Warning: Removed 127 rows containing non-finite values (stat_bin).

Our expectation was correct in this case, we see a positively skewed graph with a high peak around 25 whereas the median is around 36. We can say that the values for total sulfur dioxide are somewhat proportional to those free sulfur dioxide.

8. Density

The data for density is normally distributed.

plot_univariate(wineQualityData$density, quantile(wineQualityData$density, 0.025), quantile(wineQualityData$density, 0.975), 0.001)
## Warning: Removed 79 rows containing non-finite values (stat_boxplot).
## Warning: Removed 79 rows containing non-finite values (stat_bin).

Both the median and the mean appear to be around 0.997. So we can positively say that our plot is normally distributed.

9. pH

The data for pH level is also normally distributed.

plot_univariate(wineQualityData$pH, quantile(wineQualityData$pH, 0.025), quantile(wineQualityData$pH, 0.975), 0.05)
## Warning: Removed 80 rows containing non-finite values (stat_boxplot).
## Warning: Removed 80 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).

Both the median and the mean appear to be around 3.3. So we can positively say that our plot is normally distributed.

10. Sulphates

In this case we put our limits at 0.3 and 1.

plot_univariate(wineQualityData$sulphates, 0.3, 1, 0.05)
## Warning: Removed 58 rows containing non-finite values (stat_boxplot).
## Warning: Removed 58 rows containing non-finite values (stat_bin).

11. Alcohol

Most of the alcohol percentage is around 9 to 11%, which is normal and a few values goind till 13.

plot_univariate(wineQualityData$alcohol, 9, 13, 0.5)
## Warning: Removed 30 rows containing non-finite values (stat_boxplot).
## Warning: Removed 30 rows containing non-finite values (stat_bin).

This graph is positively skewed with a median around 10.2, which is normal beacuse most of the wines have their alcohol percentange in 9% to 11% range.

Bivariate Data Analysis

Property vs Quality

Density vs Quality

plot_bivariate_wrt_quality(wineQualityData$density)

From the above plots we can see that wines with higher quality have low median density. We can see a negative correlation between quality and density of a wine.

Alcohol vs Quality

plot_bivariate_wrt_quality(wineQualityData$alcohol)

Higher quality wines in the dataset have higher alcohol content on average as compared to the lower quality ones. There is a positive correlation between alcohol and quality.

pH Level vs Quality

plot_bivariate_wrt_quality(wineQualityData$pH)

Wines are generally acidic in nature which explains that almost all pH levels are below 7 (which is neutral). We can observe that most wines have pH level within range 3 to 4, and there is a slight negative correlation.

Residual Sugar vs Quality

plot_bivariate_wrt_quality(wineQualityData$residual.sugar)

There are many outliers for the residual sugar property. Let’s filter out the outliers and plot the values.

x = seq(wineQualityData$residual.sugar)
grid.arrange(ggplot(wineQualityData, aes(x = x, y = wineQualityData$residual.sugar, color = factor(wineQualityData$quality), shape = factor(wineQualityData$quality))) + 
               geom_point(size = 2, alpha = 0.4) + 
               scale_color_identity(guide = 'legend') + 
               ylab('Property') + 
               ylim(0, 6) + 
               xlab('#'),
    ggplot(wineQualityData, aes(x = factor(wineQualityData$quality), y = wineQualityData$residual.sugar)) + 
      geom_boxplot() + 
      ylim(0, 6),
    nrow = 2)
## Warning: Removed 48 rows containing missing values (geom_point).
## Warning: Removed 48 rows containing non-finite values (stat_boxplot).

The residual sugar content is almost the same for all qualities of wine.

Suplhates vs Qualtiy

plot_bivariate_wrt_quality(wineQualityData$sulphates)

Loooks like even suplhates has a lot of outliers, however we can observe a positive correlation from the boxplot. Let’s have a closer look.

x = seq(wineQualityData$sulphates)
grid.arrange(ggplot(wineQualityData, aes(x = x, y = wineQualityData$sulphates, color = factor(wineQualityData$quality), shape = factor(wineQualityData$quality))) + 
               geom_point(size = 2, alpha = 0.4) + 
               scale_color_identity(guide = 'legend') + 
               ylab('Property') + 
               ylim(0, 1) + 
               xlab('#'),
    ggplot(wineQualityData, aes(x = factor(wineQualityData$quality), y = wineQualityData$sulphates)) + 
      geom_boxplot() + 
      ylim(0, 1),
    nrow = 2)
## Warning: Removed 58 rows containing missing values (geom_point).
## Warning: Removed 58 rows containing non-finite values (stat_boxplot).

Yes, our observation was correct , better quality wines have higher sulphates content.

Correlation

correlations <- c(
  cor.test(wineQualityData$fixed.acidity, as.numeric(wineQualityData$quality))$estimate,
  cor.test(wineQualityData$volatile.acidity, as.numeric(wineQualityData$quality))$estimate,
  cor.test(wineQualityData$citric.acid, as.numeric(wineQualityData$quality))$estimate,
  cor.test(log10(wineQualityData$residual.sugar), as.numeric(wineQualityData$quality))$estimate,
  cor.test(log10(wineQualityData$chlorides), as.numeric(wineQualityData$quality))$estimate,
  cor.test(wineQualityData$free.sulfur.dioxide, as.numeric(wineQualityData$quality))$estimate,
  cor.test(wineQualityData$total.sulfur.dioxide, as.numeric(wineQualityData$quality))$estimate,
  cor.test(wineQualityData$density, as.numeric(wineQualityData$quality))$estimate,
  cor.test(wineQualityData$pH, as.numeric(wineQualityData$quality))$estimate,
  cor.test(log10(wineQualityData$sulphates), as.numeric(wineQualityData$quality))$estimate,
  cor.test(wineQualityData$alcohol, as.numeric(wineQualityData$quality))$estimate)
  
  names(correlations) <- c('fixed.acidity', 'volatile.acidity', 'citric.acid',
                         'log10.residual.sugar',
                         'log10.chlordies', 'free.sulfur.dioxide',
                         'total.sulfur.dioxide', 'density', 'pH',
                         'log10.sulphates', 'alcohol')

correlations
##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
## log10.residual.sugar      log10.chlordies  free.sulfur.dioxide 
##           0.02353331          -0.17613996          -0.05065606 
## total.sulfur.dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##      log10.sulphates              alcohol 
##           0.30864193           0.47616632

From the above values we can say that alcohol, volatile acidity and sulphates have higher correlation with the qualtiy. We already observed that alcohol and sulphates have positive correlation with quality. Let’s have a look at volatile acidity vs quality.

Volatile Acidity vs Quality

plot_bivariate_wrt_quality(wineQualityData$volatile.acidity)

Volatile Acidity has a strong negative correlation wrt wine quality.

Multivariate Analysis

In the previous section we observed what properties have direct effect on the quality of wines. Let’s have a look at how combinations of these factors affect the quality.

ggplot(wineQualityData, aes(x = wineQualityData$volatile.acidity, y = wineQualityData$alcohol, color = factor(wineQualityData$quality))) + 
  geom_point(alpha = 0.5) + 
  scale_color_identity(guide = 'legend')

The above graph shows that wines with higher alcohol content and lower volatile acidity tend to have higher quality rating.

ggplot(wineQualityData, aes(x = wineQualityData$sulphates, y = wineQualityData$alcohol, color = factor(wineQualityData$quality))) + 
  geom_point(alpha = 0.5) + 
  scale_color_identity(guide = 'legend')

Good quality wines tend to have lower sulphates level. Based on the past two observations we can expect a graph of sulphates and volatile acidity to have good quality wines to be prevalent in the bottom left of the graph. Let’s have a look.

ggplot(wineQualityData, aes(x = wineQualityData$sulphates, y = wineQualityData$volatile.acidity, color = factor(wineQualityData$quality))) + 
  geom_point(alpha = 0.4) + 
  scale_color_identity(guide = 'legend')

This graph stays true to our expectation. A lot of good quality wines lie in the bottom left of the graph.

Final Plots and Summary

Plot 1: Volatile Acidity vs Qualtiy

This graph shows us a strong negative correlation between wine quality and volatile acidity. Better the wine quality, lower the volatile acidity in it.

ggplot(wineQualityData, aes(x = factor(wineQualityData$quality), y = wineQualityData$volatile.acidity)) + 
  geom_boxplot(color = 'black', fill = 'cadetblue3', alpha = 0.4) + 
  ylab('Volatile Acidity') + 
  xlab('Quality')

Plot 2: Alcohol vs Quality

We observed that alcohol content has a strong postivie correlation with respect to quality. The following graph depicts that.

ggplot(wineQualityData, aes(x = factor(wineQualityData$quality), y = wineQualityData$alcohol)) + 
  geom_boxplot(color = 'black', fill = 'cadetblue3', alpha = 0.4) + 
  ylab('Alcohol') + 
  xlab('Quality')

Alcohol vs Volatile Acidity vs Quality

ggplot(wineQualityData, aes(x = wineQualityData$volatile.acidity, y = wineQualityData$alcohol, color = factor(wineQualityData$quality))) + 
  geom_point(alpha = 0.5) + 
  scale_color_identity(guide = 'legend') + 
  xlab('Alcohol') + 
  ylab('Volatile Acidity') + 
  labs(color = 'Quality')

The above plots help us understand that Volatile acidity and alcohol are the major properties that affect the quality of a wine. There are other factors like density, pH level and sulphates that also affect wine quality to some extent.

Reflections:

We were able to figure some properties that might be affecting the quality of a wine. However our dataset only had 1599 different wines, which were produced in a certain region of Portugal, which is much less than the large number of wines that are available in the market. Therefore our analysis need not necessarily apply to wines made in other countries. We also need to understand that the dataset was created by fixed group of individuals and since the taste differs from person to person, the ratings provided by this fixed group of individuals need not necessarily apply to the entire populace.